12/15/23
Goal: build a flexible yet interpretable model like:
\[\begin{equation} \boldsymbol{y} = f(\boldsymbol{x}_1, ..., \boldsymbol{x}_P) + \boldsymbol{e} \end{equation}\]
General linear models (GLM), e.g. linear or logistic regression are interpretable and can be flexible, but you need to decide:
MARS (multivariate adaptive regression splines) automatically determines all of this for you 😃
MARS uses simple piecewise linear functions (“splines”) that can approximate complex relationships
Question: what are 3 things you need to determine when fitting a piecewise linear spline?
Suppose knot \(t = 0.5\). Here is what the hinge \(\beta_{L}(t-x)_+ + \beta_{R}(x-t)\) looks like for various coeffients:
…the regression surface is built up parsimoniously, using nonzero components locally—only where they are needed. This is important, since one should “spend” parameters carefully in high dimensions, as they can run out quickly [“curse of dimensionality”]. The use of other basis functions such as polynomials, would produce a nonzero product everywhere, and would not work as well (Hastie et al. 2009)
lm/glm http://www.milbo.org/doc/earth-notes.pdf
Nonparametric?
\[\begin{equation} \boldsymbol{C} = \big\{(X_j - t)_+, (t - X_j)_+\big\} \\ t \in \{x_{1j}, x_{2j}, ..., x_{Nj}\}; \ j = 1,2, ..., P \end{equation}\]
“At each stage we consider all products of a candidate hinge in \(\boldsymbol{C}\) with a hinge in the model \(\boldsymbol{M}\). The product that decreases the residual error the most is added into the current model.” (Hastie et al. 2009)
Thus at each step, it’s possible to add:
Example first 3 steps:
The term whose removal causes the smallest increase in residual squared error is deleted from the model at each stage, producing an estimated best model \(f_\lambda\) of each size (number of terms) \(λ\) (Hastie et al. 2009)
Thus the best models of size 1, 2, …, nprune features are identified
Best? Measured by fast generalized cross validation (GCV) or more accurate but slower K-fold CV
GCV provides a convenient approximation to leave-one out cross-validation for linear models [without needing to split/resample/refit data](Hastie et al. 2009)
Potential tuning parameters:
degree of interactions allowed (set to 1 for none).nprune)nk)Simplest tuning strategy:
degree to a moderate value like 5 and use default nk
nprunenkdegreeMedium tuning strategy:
npruneAdvanced tuning strategy:
degree, nk and use K-fold CV to optimize.earth R packagelibrary(earth);
fit.gcv <- earth(
formula = fev ~.,
data = fev,
degree = 5,
keepxy = TRUE
);
print(fit.gcv);Selected 6 of 17 terms, and 4 of 4 predictors
Termination condition: Reached nk 21
Importance: height.inches, sexMale, age, smokeYes
Number of terms at each degree of interaction: 1 3 2
GCV 0.1487769 RSS 93.32458 GRSq 0.8024061 RSq 0.8098985
GRSq normalizes GCV from 0 to 1, similar to adjusted \(R^2\).set.seed(123);
fit.cv <- earth(
formula = fev ~.,
data = fev,
degree = 5,
keepxy = TRUE,
pmethod = 'cv',
nfold = 10,
ncross = 5
);
print(fit.cv);Selected 5 of 17 terms, and 3 of 4 predictors (pmethod="cv")
Termination condition: Reached nk 21
Importance: height.inches, sexMale, age, smokeYes-unused
Number of terms at each degree of interaction: 1 3 1
GRSq 0.8013863 RSq 0.8074228 mean.oof.RSq 0.7912086 (sd 0.049)
pmethod="backward" would have selected:
6 terms 4 preds, GRSq 0.8024061 RSq 0.8098985 mean.oof.RSq 0.7877625
Let’s explore the predictor effects from fit.gcv:
Selected hinge functions and \(\hat{\beta}\):
fev
(Intercept) 2.74759130
h(65-height.inches) -0.09188745
h(age-8) 0.08327821
h(height.inches-65)*sexMale 0.24942931
h(height.inches-68) -0.14391813
h(age-8)*smokeYes -0.02834601
Variable importance scores:
Partial dependence plots:
plotmo() can be used to plot the estimated effectspdp R package for making similar plots:earth can be used for continuous, count, binary or multinomial outcomes
caret’s bagged MARS can improve prediction performance at the expense of interpretabilityMARS automatically handles:
Other benefits:
Downsides?